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Abstract 

When computing a confidence interval for a binomial proportion p one must 
choose between using an exact interval, which has a coverage probability 
of at least 1 — a for all values of p, and a shorter approximate interval, 
which may have lower coverage for some p but that on average has coverage 
equal to 1 — a. We investigate the cost of using the exact one and two-sided 
Clopper-Pearson confidence intervals rather than shorter approximate in- 
tervals, first in terms of increased expected length and then in terms of the 
increase in sample size required to obtain a desired expected length. Using 
asymptotic expansions, we also give a closed- form formula for determin- 
ing the sample size for the exact Clopper-Pearson methods. For two-sided 
intervals, our investigation reveals an interesting connection between the 
frequentist Clopper-Pearson interval and Bayesian intervals based on non- 
informative priors. 

Keywords: Asymptotic expansion; binomial distribution; confidence inter- 
val; expected length; sample size determination; proportion. 

1 Introduction 

Inference for a binomial proportion p is one of the most commonly encountered 
statistical problems, with important applications in areas such as clinical trials, risk 
analysis and quality control. Consequently, a large number of two-sided confidence 
intervals and one-sided confidence bounds for p have been proposed by different 
authors. These are of two different types: exact methods, that have a coverage at 
least equal to 1 — a for all p G (0, 1), and approximate methods, that may have 
coverage less than 1 — a for some values of p, but that have a coverage that in 
some sense is approximately equal to 1 — a. 

Research on confidence intervals and bounds for a binomial proportion has 
mostly focused on approximate intervals. In the methodological literature, exact 



intervals have often been deemed to be too conservative (Agresti & Coull, 1998 



Brown et al. 2001; Newcombe & Nurminen 2011), as they tend to be quite wide 
and have actual coverage levels that often are noticeably greater than I — a. Never- 
theless, the use of exact intervals for proportions is abundant among practitioners: 
see e.g. lAbramson et al.l (120131), llbrahim et al.l (120131), iWard et al.l (I2013J) and 



Sullivan et al. (2013) for some recent examples. By far the most widely used exact 



interval is the Clopper-Pearson interval, introduced by Clopper & Pearson (1934). 

The benefit of using an exact interval is obvious: one does not risk that the 
actual coverage falls below 1 — a. For this reason, some regulatory authorities 
require that exact intervals be used. Moreover, the binomial distribution is unusual 
in that we often can be sure that it is an accurate description of that which we 
are modelling and not just an approximation to the true distribution, as is often 
the case when continuous distributions are used for modelling. In such a situation, 
using an exact method seems reasonable. But there are also costs associated 
with the use of such an interval. When choosing between approximate and exact 
confidence methods, there is a trade-off in that exact intervals and bounds by 
construction are wider than the best approximate intervals, or equivalently, require 
a larger sample size in order to obtain a certain expected length. If one is unwilling 
to accept intervals and bounds with undercoverage for some values of p, there is a 
cost to pay in terms of expected length or required sample size. This paper seeks 
to quantify these costs. 

In planned experiments, it is always important to determine a suitable sam- 
ple size. Sample size determination for binomial confidence intervals has received 



much attention in recent years (Katsis, 2001 Piegorsch, 2004 Krishnamoorthy & 



Peng, 


2007 


M'Lan et al. , 


2008 



different authors studying different intervals and methods for sample size calcula- 
tions, the latter often of a computer-intensive nature. The first main contribution 
of this paper is closed-form formulas for computing the sample size required for 
the Clopper-Pearson methods to obtain a given expected length. This eliminates 
the need for computer-intensive methods for computing sample sizes and gives a 
better understanding of how the desired length and the parameters p and a affect 
the sample size. 

The second main contribution is closed-form expressions for the excess length 
and increase in required sample size that comes from using the exact Clopper- 
Pearson methods instead of approximate methods. We obtain these expressions by 
deriving asymptotic expansions for the exact Clopper-Pearson methods, extending 
the work of Brown et al. (2002), Cai (2005) and Staicu (2009) on the asymptotics 
of approximate binomial confidence methods to exact intervals and bounds. 

The rest of the paper is organised as follows. In Section [2] we introduce the 
Clopper-Pearson methods along with other exact and approximate confidence 
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methods. In Section [3] we give an asymptotic expression for the expected length of 
the Clopper-Pearson interval. This allows us to give a formula for computing the 
sample size, and to determine the cost of using an exact interval rather than an 
approximate interval, in terms of expected length and sample size. In Section [4] we 
discuss the one-sided Clopper-Pearson bound and give expressions for its expected 
distance to p and the cost of using an exact bound. In Section [5] we discuss costs 
associated with approximate intervals and state some conclusions. All proofs and 
technical details are deferred to an appendix. 



2 Binomial confidence methods 

2.1 The Clopper-Pearson interval and bounds 

The two-sided Clopper-Pearson interval for a proportion p is an inversion of the 
equal-tailed binomial test: the interval contains all values of p that aren't rejected 
by the test at confidence level a. Given an observation X, the lower limit is thus 
given by the value of pi such that 



n 

' n 



k=X 

and the upper limit is given by the pu such that 

x 



EQ^(i-^r fc = «/2. (2) 

As is well-known, the computation of pi and pu is simplified by the following 



equality from Johnson et al. (2005). Let f(t,a,b) be the density function of a 



Beta(a,b) random variable. Then 



n \ if 



Y,\ k y^-PT- k - J o f(t,X,n-X + l)dt. (3) 



When (|3| is plugged into ([I]) and (|2]), the problem of finding p L and pu reduces to 
inverting the distribution functions of two beta distributions. Consequently, the 
endpoints of the Clopper-Pearson interval are given by quantiles of beta distribu- 
tions: 

(PliPu) = (B(a/2,X,n-X+l), B(l - a/2, X + 1, n - X)) . (4) 

When X is either or n, closed-form expressions for the interval bounds are 
available. When X = the interval is (0,1 — (a/2) 1//n ) and when X — n it 
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is ((ct/2) 1//n , 1). For other values of X, ^ must be evaluated numerically. The 
interval is implemented in most statistical software packages; it can for instance be 
found in the PropCIs package in R and computed using the PROC FREQ command 
in SAS. 



Some authors (Agresti & Coull, 1998 Brown et al. 2001) have argued that 



when choosing between confidence intervals, it is often preferable to use an in- 
terval with a simple closed-form formula rather than one that requires numerical 
evaluation, as the former is easier to present and to interpret. Next, we give 
asymptotic expansions of pl and pu, that function as good approximations when 
n > 40, and can be used if a closed-form formula for the Clopper-Pearson interval 
is desired. As an example, when n = 50 the upper bound is accurate up to two 
decimal places for X ^ {0, 1, 2, n}. 

Theorem 1. Let X £ {1, 2, . . . , n — 1} be fixed. Let p = Xj n, q = 1 — p and z a /2 

be the upper a/2 quantile of the standard normal distribution. 

The bounds of the Clopper-Pearson interval are, up to 0(n~ 3 / 2 ), 



p L =p- n-^z^pq) 1 ' 2 + (S^)- 1 (2(1/2 -p)z 2 a/2 - (1 + p)) 
Pu =p + n- l l 2 z a/2 {pq) 112 + (3U)- 1 (2(1/2 - p)z 2 a/2 + l + q 



and 



Similar in construction to the two-sided interval, the one-sided Clopper-Pearson 
bounds are obtained by inverting one-sided binomial tests. Thus the I— a Clopper- 
Pearson upper bound pu is given by the pu such that 



fc=0 



In the following, we limit our study to upper bounds. For symmetry reasons, 
the results are however equally valid for lower bounds, as for the bounds under 
consideration, a lower bound pl for p is equivalent to an upper bound for q, as 
q v = 1 - p L . 

If a closed-form expression for pu is desired, it can be obtained in the form of 
an asymptotic expansion by replacing a/2 with a in Theorem [l] above. 



2.2 Other exact intervals 

In much of the medical literature, as well as the rest of the present paper, the 
Clopper-Pearson interval is refered to as the exact confidence interval for a bi- 
nomial proportion. Despite this terminology, several other exact intervals have 
been proposed throughout the years. These alternative intervals do not admit 
closed-form expressions and are, to varying extents, computer-intensive. 
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There are several reasons as to why the Clopper-Pearson interval is the most 
widely used exact interval. One is simply tradition and availability: it has found 
its way in to classic statistical textbooks and has been implemented in almost all 
statistical software packages. Compared to the computer-intensive alternatives, 
the Clopper-Pearson interval is also considerably simpler computationally. Finally, 
it remains a natural choice in that it is the inversion of the well-known equal-tailed 
binomial test. 

In the two-sided case, however, there is room for improvement, at least if one 
is willing to let go of some natural properties of confidence intervals. Other exact 
intervals have been designed to be shorter than the Clopper-Pearson interval, 
by inverting two-sided tests that need not be equal-tailed. Moreover, the coverage 
probabilities of these intervals often fluctuate less from 1 — a than does the coverage 
of the Clopper-Pearson interval. 



The Blyth-Still-Casella interval (Blyth & Still, 1983 Casella, 1986) is guar- 



anteed to be the shortest exact interval, but has the odd property that it is not 
nested, in the sense that the 90 % interval need not be contained in the 95 % 
interval (Blaker, 2000, Theorem 2). This is also true for the intervals of Crow 
(fl956l. 



The Sterne (1954) procedure yields nested intervals that are shorter than the 
Clopper-Pearson interval, but will in some cases result in two separate intervals 
rather than one connected interval. Blaker (2000) proposed a nested exact interval 
that, while wider than the Blyth-Still-Casella interval, always is contained in the 
Clopper-Pearson interval. It is however sometimes a union of disjoint intervals and 
its upper bound is decreasing but not strictly decreasing in a when n and X are 
fixed (Vos & Hudson, 2008). The interval based on the inverted exact likelihood 
ratio test suffers from similar problems (Vos & Hudson, 2008). 

The Clopper-Pearson interval, in contrast, is nested, is always a connected set 
and has bounds that are strictly monotone in a. While it is possible to obtain 
shorter exact confidence intervals for a binomial proportion, this seems to be asso- 
ciated with the loss of nestedness, connectedness or monotonicity. As we consider 
these properties to be of importance, we will only include the Clopper-Pearson 
interval and bounds in the following sections, and will out of convenience refer to 
them as the exact methods. 

Implementations of some of the alternative exact intervals are readily available. 
The Blyth-Still-Casella interval has been implemented in StatXact and Blaker| 
(2000) gave an S-PLUS function for his interval. A more efficient implementation 
of Blaker's interval is found in the R package BlakerCI (Klaschka 2010). 
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2.3 Approximate confidence intervals and bounds 



Throughout the text, the Clopper-Pearson methods will be compared to several 
well-known approximate methods. These are described below, along with the 
commonly used Wald interval. For more thorough reviews of binomial confidence 
methods, see 



Newcombe (2012 


), 


Cai 


(2005 


) and 


Brown et al. 


(2001 


2002 



descriptions below, p = X/ n is the sample proportion, q = 1 — p and z a / 2 is the 
100(1 — a/2)th percentile of the standard normal distribution. 

The Wald interval. Inversion of the large sample test \(p — p)(pq/n)~ l l 2 \ < 
z a / 2 leads to the Wald interval, which is presented in virtually every introductory 
statistics course: p ± z a /2^/pq/n. The Wald interval suffers from particularly 



erratic coverage properties, and cannot be recommended for general use (Brown 



et al. 


2001 


Newcombe , 


2012 



The Wilson score interval. Like the Wald interval, the Wilson ( |1927 ) score 
interval is based on an inversion of the large sample normal test | (p — p)/ d(p) \ < 
z a /2, where d(p) is the standard error of p. Unlike the Wald interval, however, the 
inversion is obtained using the null standard error (p(l — p)/n) l l 2 instead of the 
sample standard error. The solution of the resulting quadratic equation leads to 
the confidence interval 

X + z 



•a/2/ 2 , z a / 2 nz ; 2 7a 
n + z 2 n + z 2 M™ a ' 2 ' 



The Wilson score interval has favourable coverage and length properties and is 



Brown et al. 


2001 


Newcombe 


2012) 



The Agresti-Coull interval. For 95% nominal coverage, Agresti & Coull (1998) 



proposed the use of the Wald interval with two successes and two failures added, 



i.e. with n replaced by n + 4 and X replaced by X + 2. More genera l 



n 



n + z 



a/2' 



X 



X + z 2 a/2 /2, 



P 



X/n and q — 1 — p. 



Brown et al. 



let 



(2001) 



dubbed the interval p ± z a /% \Jpq/h the Agresti-Coull interval. It has performance 
close to that of the Wilson interval, but is somewhat simpler to use. 

Bayesian Beta intervals and bounds. Let B(a,a, b) denote the a-quantile of 
the Beta(a, b) distribution. An equal-tailed Bayesian credible interval based on the 
Beta(a, b) prior is given by (B(a/2,X + a,n — X + b), B(l — a/2,X + a,n — X + b)), 
where B(a,a, b) is the quantile function of the Beta(a, b) distribution. Similarly, 
an upper bound is given by B{\ — a, X + a, n — X + b). As these methods make 
use of beta quantiles, they are algebraically very similar to the Clopper-Pearson 
interval. This connection is discussed further in Section 13.41 

The Jeffreys interval and bound. A commonly used Bayesian interval for p is the 
Jeffreys interval (B(a/2, X + 1/2, n — X + 1/2), B(l - a/2, X + 1/2, n — X + 1/2)), 
which is the equal-tailed credible interval derived using the noninformative Jeffreys 
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prior. Both the two-sided interval and the one-sided bound exhibit favourable 



frequentist properties (Brown et al. 2001 Newcombe, 2012 Cai, 2005) 



The second-order correct bound. Cai (2005) proposed a coverage-corrected ver 



sion of the one-sided Wald bound, based on second-order asymptotic expansions. 



Cai ( 2005 ) recommended it for general use and gave a closed-form expression for 



Staicu (2009) studied the bound ob- 



the bound. 

The modified loglikelihood root bound. 
tained by inverting the modified loglikelihood root test and found it to have very 
favourable coverage and length properties. It cannot be expressed in a closed form, 



but Staicu (2009) gave asymptotic expansions that can be used as approximations. 



3 Two-sided intervals 
3.1 Expected length 

Let q = 1 — p and let Lqp = Pu — Pl denote the length of the Clopper-Pearson 
interval. Next, we present an asymptotic expression for the expectation of Lap- 

Theorem 2. As n —> oo the expected length of the 1 — a Clopper-Pearson interval 
is 

E(L CP ) =2z a/2 n~ 1 ' 2 {pq) 1 ' 2 + n' 1 



■ 1/2 z a/2 f 2 



17pq - Upqzl /2 ) + 0{n 



(6) 



The expansion (|6]) is compared to the actual expected length in Figure [TJ Even 
for small values of n, the approximation comes quite close to the actual expected 
length over the entire parameter space. 



Figure 1: Comparison between the actual expected length and the expansion (|6| 
for the nominal 95 % Clopper-Pearson interval. 



n=10 



n=20 



n=40 



Exact 
Asymptotic 
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0.2 0.3 



0.4 0.5 



Having an expression for the expected length of the Clopper-Pearson interval 
allows us to evaluate its performance for different combinations of n, p and a. 



7 



When planning an experiment, this is extremely useful as it can be used to de- 
termine what sample size we need in order to achieve a desired expected length. 
Methods for determining sample size are discussed next. 



3.2 Sample size determination 

Several different criterions can be considered when determining sample size, as 



discussed e.g. by Gongalves et al. (2012). We focus on a comparatively simple 



criterion: for a fixed confidence level 1 — a we wish to find the smallest sample 
size n such that the expected length of the confidence interval is some fixed value 
d. As the value of n will depend on p, we require that an initial guess p for p is 
available. 



Studying the Clopper-Pearson interval, Krishnamoorthy & Peng (2007) gave 
a first-order approximation of E(Lcp) in the form of beta quantiles and used that 
to numerically calculate the sample size required to obtain a desired expected 
length d. Ignoring the higher terms of the expansion (J6]) we obtain the second- 
order approximation E(Lcp) ~ 2z a /2'n^ 1 ^ 2 (pq) 1 ^ 2 + n -1 , which can be evaluated 
analytically. Given an initial guess po for p, the equation 2z a / 2 n~ 1//2 (j>o<?o) 1 ^ 2 + 
rT l = d has the solution 



n = 



2z 2 p q Q + 2z^Jz 2 plql + dp q Q + dr 
d 2 



(7) 



when rounded up to the nearest integer. This is a good approximation of the 
actual required sample size, with a small positive bias. At the 95 % level it does 
typically not differ by more than 4 from the solution obtained by more complicated 
(and computer-intensive) exact numerical computations. For p close to 1/2, the 
Krishnamoorthy-Peng method is slightly more accurate, whereas for p close to 
or 1, (|7|) gives a better approximation. In either case, both approximations 
are accurate enough for most applications. As an example, when po = 0.05 and 
d = 0.05, the actual required sample size is 329, while our approximation yields 
n = 331, corresponding to an actual expected length of 0.0498. In comparison 
with exact methods or the Krishnamoorthy-Peng procedure, ^ offers greater 
computational ease without sacrificing much accuracy. 

It is likewise possible to solve the cubic equation that results from including 
the n -3 / 2 -term of but the solution does not yield a simple formula and does 
not give substantially improved accuracy. 

A downside to this approach to sample size determination is that the initial 
guess po may be quite wrong. This is particularly problematic if p is closer to 1/2 
than is po, in which case the calculated required sample size will be too small. 
As a safety measure, it is sometimes recommended to use the conservative guess 
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Po = 1/2, which maximizes the required sample size. More often than not, however, 
this choice is needlessly conservative. 

An alternative approach, with a Bayesian flavour, is to use a prior distribution 
for p when determining the sample size. Beta distributions constitute a flexible 
and analytically tractable class of priors for p. For p ~ Beta(a, b), we have 

f(9, n-vz(™v/ 2 + n -A -9, n -i/ 2 r(q + i/2)r(& + i/2) , 

E(2z a/2 n >{pq)t +n ) - 2z a/2 n (a + 6)r(fl)r(6) + " ■ 

With i?(a,6) = T(a + l/2)r(6 + 1/2) [(a + b)T(a)T(b)}~ 1 , this gives the required 
sample size 



2z 2 a/2 R 2 (a, b) + 2z a/2x /z 2 a/2 R 4 (a, b) + dR 2 {a 1 b) + d 
n = — 



d? 

When applying a frequentist procedure, the prior information about p is typi- 
cally diffuse, indicating that a low-informative prior should be used so as not to bias 
the sample size determination. One example is the Jeffreys prior Beta(l/2, 1/2), 
which puts more probability mass close to and 1 and yields R(l/2, 1/2) = 
1/tt. Other examples include the uniform Beta(l, 1) prior, for which we have 
R(l, 1) = 7r/8 and the Beta(2, 2) prior, which puts more mass close to 1/2, yield- 
ing R(2, 2) = 9vr/64. 

The required sample size for different combinations of p and a is shown in 
Figure [2] It is decreasing in a, increasing in p when p < 0.5 and decreasing in p 
when p > 0.5. 



Figure 2: The required sample size for the Clopper-Pearson interval for different 
combinations of p and a. 



Varying p, a=0.05 Prior on p, a=0.05 p=0.25, varying a 




0.10 0.15 0.20 0.25 

Desired expected length 



0.15 0.20 
Desired expected length 



0.15 0.20 0.25 
Desired expected length 



Remark. In formulas similar to those above, some authors use d to denote 
the expected half-length, or error tolerance, of a confidence interval. This may be 
inappropriate in the binomial setting, since using the half-length might give the 
false impression that all confidence intervals are symmetric about the unbiased 
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estimator p = X/n. This is not the case for the Clopper-Pearson interval and 
most good approximate intervals, including those presented in Section 2.3| As an 
example, when n = 50 and p = 0.01, the expected length of the Clopper-Pearson 
interval is 0.044. Since the interval is boundary respecting, most of its length will 
be placed above p. The expected length is very much an interesting quantity when 
determining sample size, but for binomial proportions it should not be interpreted 
in terms of error tolerances. 



3.3 The cost of using the exact interval 



Next, we will study the cost of using the exact Clopper-Pearson interval instead 
of an approximate interval. We will do so by comparing the exact interval to three 
of the approximate intervals described in Section 2.3 the Wilson score, Jeffreys 



and Agresti-Coull intervals. These intervals have been recommended as default 



intervals for a single proportion by several authors (Agresti & Coull, 1998 Brown 



et al. 2001 Newcombe, 2012). 



First, we measure the cost in terms of increased expected length. By comparing 
the expansion in Theorem [2] to the expansions in Theorem 7 of Brown et al. 



(2002), we get the following expressions for how much the expected length of the 



confidence interval increases when the Clopper-Pearson interval is used instead of 
an approximate interval. 

Corollary 1. The Clopper-Pearson interval is asymptotically wider than the ap- 



proximate intervals described in Section \2.3\ In particular, compared to the length 
Lj of the Jeffreys interval, 



E(L 



CP) 



E(Lj) + U- 1 + 0(n 



and if La denotes the length of the Wilson or Agresti-Coull interval, 



E(L 



CP) 



E(L A ) + n~ l + 0(ri 



-3/21 



(9) 



Expanded versions of (|9j for the different intervals, including the n 3//2 -terms, are 
given in the proof in the appendix. 

Up to O0~ 3/2 ), the increase in expected length is inversely proportional to 
n. Note that, up to 0(n~ 3 / 2 ), the increase does not depend on p or a. The cost 
of using an exact interval, in terms of expected length, is thus more or less con- 
stant for a fixed n. This is an interesting and somewhat unexpected fact, since the 
expected lengths of these confidence interval are highly dependent on both p and a. 



Next, we consider required sample size. As the Clopper-Pearson interval is wider 
than the approximate intervals, it naturally requires larger sample sizes to obtain 
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Figure 3: The increase in required sample size when using the Clopper-Pearson 
interval instead of the Jeffreys, Wilson score and Agresti-Coull intervals, as ap- 



proximated by ( 10 )— ( 12 ). 



Increase for the Jeffreys interval 
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- p=0.1, a=0.05 
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Increase for the Wilson score interval 



Increase for the Agresti-Coull interval 



0.10 0.15 0.20 0.25 

Desired expected length 




0.10 0.15 0.20 0.25 0.30 

Desired expected length 



a particular expected length d. Let ncp{d,p,a) be the minimum sample size for 
which Ep(Lcp) < d at the 1 — a level. Similarly, let nj(d,p,a) be the minimum 
sample size for which the expected length of the Jeffreys interval is at most d under 
p at the 1 — a level. 



As noted by Piegorsch ( 2004 ) , the sample size for the Jeffreys interval is well ap- 
proximated by nj(d,p ,a) = Az^ 2 p q d~ 2 . Comparing this to (7J) without round- 
ing, the increase in required sample size rij(d,p ,a) = ricp(d,p ,a) — rij(d,p ,a) 
can be approximated by 



n^(d,p ,a) 



d - 2z a/2 (z a / 2 poqo - a/(^ q / 2 Po<?o) 2 + dp q 
cP 



(10) 



This approximation is quite accurate, generally differing by less than 1 when 
compared to the value for nj obtained using substantially more computer-intensive 
exact computations. 



( 10 ) is plotted as a function of d for three choices of po in Figure |3j When shorter 
intervals are desired, the increase in required sample size can be substantial. When 
d = 0.05, for instance, rij is 40 for 0.05 < p < 0.95. 

As was the case for the expected length, the increase nj is remarkably insen- 
sitive to p and a: there is no concernable difference when 0.05 < p < 0.95 and 
0.001 < a < 0.2. The cost of using an exact interval instead of the Jeffreys interval 
is, in terms of required sample size, constant for a fixed expected length d. 



Moving on to the Wilson score interval, Piegorsch (2004) gave the following 
formula for its sample size: 



n ws (d,p , a) = z 2 a/2 [p q + d 2 /2 + ^/p 2 q 2 + d 2 (p - 1/2)2] [<p/ 2 ]-\ 
The increase n^ s (d,po,a) = ncp(d,po,a) — riws(d,po,a) can thus be approxi- 



11 



mated by 



nws( d 'Poi a ) ~ d 2 d i l + dz l/2) + 2z a/2^ z 2 a/2 plql + dp q 



sj^l-M + d2z l/2(Po - 1/2) : 



The approximation is good when p Q is not very small, typically not differing by 
more than 2 from the exact value. 



-a/2 



Similarly, Piegorsch (2004) gave the formula nAc( d , Po,oc) 
for the sample size of the Agresti-Coull interval. Consequently, the increase 



4z 2 a/2 Poqo d 2 



n\ c (d,po,a) = ricp(d,po,a) — riAc(d,Po,ac) is approximately 

d + z 2 a/2 (d 2 - 2p q ) + 2z a/ 2 V / (z a/2 Poqo) 2 + dp q 



d 2 



(12) 



The expressions (11) and (12) are plotted for some combinations of p and a in 
Figure [3j For the Agresti-Coull interval, the cost is more or less constant in p, 
but is sensitive to changes in a. For the Wilson score interval, the cost depends 
on both p and a. 



3.4 The exact frequentist interval and Bayesian credible 
intervals with noninformative priors 



a 



Equation (|8]) in Corollary [T] and the fact that ( 10 ) is so insensitive to p and 
reveal a strong connection between the frequentist Clopper-Pearson interval and 
the Bayesian credible interval derived under the Jeffreys prior. In the light of these 
results, it seems natural to think of the Bayesian interval as a sort of continuity- 
correction of the Clopper-Pearson interval, in which conservativeness is sacrificed 
in order to get a short interval. 

Attempts to connect the exact frequentist interval with Bayesian intervals have 



previously been made by Brown et al. (2001 ), who argued that the Jeffreys interval 



can be thought of as a continuity-corrected version of the Clopper-Pearson interval. 
Their argument comes from a comparison between the Jeffreys interval and the 
mid-p interval, which generally is considered to be a continuity-corrected Clopper- 
Pearson interval. However, the key step in their argument is their equation (17), 
which is incorrect; it relies on the false assumption that for two continuous func- 
tions h and f 2 , (A + f 2 )~ 1 = /f 1 + fc\ 

Another natural noninformative Bayesian interval is that based on the uniform 
prior, Beta(l, 1). The Clopper-Pearson interval is essentially this interval after 
the prior information has been removed, a fact which we have not seen mentioned 
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before in the literature. To see this, note that for a central Bayesian interval 
with prior Beta(a,b), a,b > 0, the lower bound is given by the beta quantile 
PL,B{ct,b,X,n) = B(a/2,X + a,n — X + b). The parameters a and b can be 
interpreted as additional successes and failures added to the data. For the uniform 
prior, a = b = 1. The lower bound of the Clopper-Pearson interval is similarly 
the beta quantile B(a/2,X,n — X + 1). When X ^ {0,n} this can be written as 
B(a/2, (X - 1) + 1, (n- 1) - (X - 1) + 1) = p L , B (l, 1, X- 1, n- 1), the lower bound 
of the Beta(l, 1) interval with one success and one failure removed. Expressed in 
words, the lower bound of the Clopper-Pearson interval equals the lower bound 
of the Bayesian interval with the uniform prior after the prior information has 
been removed. Similarly, the upper bound is 1 — p LjB (a,b,n — X, n — 1), i.e. 1 
minus the lower bound for q under the uniform prior with one success and one 
failure removed. The Beta(l, 1) interval can thus be thought of as a shrinkage 
Clopper-Pearson interval. 



4 One-sided bounds 



4.1 Expected distance to the true proportion 

For one-sided confidence bounds, it is not the expected length that is of interest, 
but how close the bound is to p. Let Ljj,cp = Pu — V denote the distance from 
Pu to p. The next theorem gives an asymptotic expansion for the expectation of 
Lu,cp- 

Theorem 3. As n — > oo the expected distance to p for the I — a one-sided Clopper- 
Pearson upper bound is 



E(L U:CP ) =n- 1/2 z a (pq) 1 ' 2 + (3™)- 1 (2(1/2 -p)zl + 1 + g) 



i_r, .2 ,13 i~2 N _ (13) 



The expansion (13) is compared to the actual expected distance to p in Figure 
4j Like the expansion for the expected length of the two-sided interval, (13) is 
close to the actual expected distance even for small n. 
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Figure 4: Comparison between the actual expected distance and the expansion 



(13) for the nominal 95 % Clopper-Pearson upper bound. 
n=10 n=20 



n=40 
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4.2 Sample size determination 

The expressions we obtain in the one-sided case are not quite as simple as those in 
the two-sided case. Let d denote the desired expected distance to p and let po be 
the initial guess for the value of p. Proceeding as before, using the second-order 
approximation 

E(Lu,cp) « n-Wzatpq) 1 ' 2 + (3n)- 1 (2(l/2-p)z 2 a + 1 + g) 
yields the required sample size 

n = (2d 2 )~ 1 (9zlp q Q + 3z a ^/3p qo^3zlp q + 4[dz% - 2dz^p + d(l + q )] 

+ 6[2^(l/2-po) + (l + 5b)] 

This approximation is very good when d is not too small. For smaller d it has 
a small negative bias: when a = 0.05 and po = 1/2 the actual required sample size 
for d = 0.02 is n = 1738, whereas the above expression gives the approximation 
n = 1721, corresponding to a true expected distance of d = 0.0201. For most 
purposes, this will probably be a sufficiently accurate approximation. 

As in the two-sided case, we may consider using a prior distribution of p, 
rather than a fixed po, to determine a reasonable sample size. The expectation of 
the second-order approximation with respect to a Beta(a, b) prior for p is 

(2 + z 2 a )T(2 - a)r(2 - b) (2z 2 a + 1)T(3 - a)r(2 - b) z a T(h/2 - a)r(5/2 - b) 
3nT(4-a-b) 3nT(5-a-b) y/nT(5-a-b) 

(14) 

Note that this expression is undefinied when a, b > 2, limiting which priors we can 



use. When (14) is well-defined, a general formula for the required sample size can 



be obtained by equating ( 14 ) to d and solving for n, but the resulting expression is 
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rather complicated. It is however readily evaluated for particular values of a and 
b. For the Jeffreys prior for instance, the required sample size is 

H ~ I 2 + 16d ' 

The solutions for the Jeffreys and uniform priors as well as the low-informative 
asymmetric Beta(l/2, 1) prior are shown in Figure |5j along with the solutions for 
fixed p and different values of a. 

In contrast to the two-sided case, d can in fact be interpreted as an error 
tolerance for the one-sided bound. This makes the interpretation of d easier in 
this case. 



Figure 5: The required sample size for the upper Clopper-Pearson bound for 
different combinations of p and a. 
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4.3 The cost of using the exact bound 



The cost of using the exact bound will be evaluated in relation to three approximate 
bounds: The Jeffreys, second-order correct and modified loglikelihood root bounds, 
described in Section 2.3 Comparing (13) to the expansions in Corollary 1 of Cai 



(2005) and Proposition 2.2 of Staicu (2009), the following corollary is immediate. 



Corollary 2. When L v ^a denotes the distance of the Jeffreys, second-order correct 
or modified loglikelihood root bounds, 

E(Lu,cp) = E(L UA ) + (2U)- 1 + 0(n- 3/2 ). 

It should be noted that there are one-sided versions of the Wald and Wilson 



score intervals, but since these have very poor performance (Cai, 2005) they are 
omitted from our comparison. They can however readily be compared to the 



Clopper-Pearson bound by comparing (13) to the corresponding expansions in 
Corollary 1 of [Cail pOoHb. 
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For one-sided bounds, the approximation of the increased sample size when 
the exact bound is used is more involved than it was for the two-sided cases. 
To preserve space, we simply use the naive first-order formula n = z^, 2 pqd~ 2 to 
determine the sample sizes for the approximate bounds. This works reasonably 
well most of the time. Let n + (d, p, a) be the increase in sample size when the 
Clopper-Pearson bound is used instead of an approximate bound. Then, with 
u)(z,d,p) = 9z 2 pq + 12dz 2 - 2Adz 2 p, 



y/u(z a ,d,p ) + 12d(l + g ) - y/u{z a , d ^Po) + 12d(l/2 -p ) 



d 



n + (d,p ,a) « — z -. 

(15) 



Compared to the increased sample size in the two-sided setting, (15) is more sen- 
sitive to changes in p and a. The cost is the smallest when p = 0.5. When 
evaluating the increased sample size po = 0.5 is therefore not to be recommended 
as the default choice, as this can lead to a serious underestimation of the increase, 
especially for smaller d. 



Figure 6: The increase in required sample size when using the upper Clopper- 



Pearson bound instead of an approximate upper bound, as approximated by ( 15 ) 
for a 



0.05. 
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5 Discussion 

5.1 Minimum coverage or mean coverage? 

The Clopper-Pearson methods are exact in the sense that their minimum coverage 
over all p is at least 1 — a. An alternative measure of coverage is mean coverage, 
which typically is taken to be the expected coverage with respect to a uniform 
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pseudo-prior of p. In recent papers on binomial confidence intervals, approximate 



methods have often been considered to be preferable to exact methods (Agresti 



& Coull 1998; Brown et al., 2001 Cai, 2005 Newcombe & Nurminen, 2011), the 



argument being that it makes more sense to interpret the confidence level as the 
mean coverage probability rather than the minimum coverage probability, as this 
corresponds better to how many modern-day statisticians think of coverage levels. 
Reasoning along the lines of Newcombe & Nurminen (2011 ), the minimum coverage 
can occur in an uninteresting part of the parameter space, typically close to the 
boundaries, possibly rendering it an uninteresting measure of coverage. This is 
discussed further in the next section. 

As noted e.g. by Newcombe & Nurminen (2011), using mean coverage is very 
much in line with current statistical practice in other problems. Widely used 
methods based on boostrapping and MCMC, for instance, typically only control 
confidence levels and type I error rates approximately, attaining the 1 — a level 
only on average. This is particularly reasonable when the model is known to be an 
imperfect representation of the underlying process, in which case even minimum 
coverage criterions are approximate at best. Unlike in many other applications 
however, one can often be rather certain that a random variable truly is binomial. 
This begs the question whether one should resort to approximations or use methods 
that really are guaranteed to be exact. 

If the Bayesian credible intervals based on either the Jeffreys Beta(l/2, 1/2) 
or the uniform Beta(l, 1) priors are used, an additional argument for the mean 
coverage criterion is given by the Bayesian interpretation of these intervals. If we 
accept mean coverage as a criterion when choosing between confidence intervals, 
we can obtain intervals that simultaneously admit both frequentist and objective 
Bayesian interpretations. 

The minimum coverage criterion underlying the Clopper-Pearson interval is in 
line with classical statistical theory. It asserts that overcoverage is a less serious 
problem than under cover age, or, in other words, that it is better to be more 
confident than you think that you are than to be overconfident. Next, in order 
to evaluate this argument further, we will discuss just how overconfident one risks 
being when using approximate intervals. 



5.2 The cost of using approximate methods 

Just as there are costs associated with using exact methods, there are costs asso- 
ciated with using approximate methods: the actual coverage level may, even for 
large n, drop below the nominal 1 — a. There is no guarantee that the true p is 
not in an unfortunate area with low coverage. However, these coverage anomalies 
usually occur close to the boundaries of the parameter space, so unless we are 
interested in inference for p close to or 1, it may therefore be more relevant to 
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Figure 7: Minimum coverage of two-sided approximate intervals over p G 
[0.01,0.99] or p G [0.1,0.9] when a = 0.05, computed over a grid of 200,000 
equidistant points. 

Jeffreys interval Wilson score interval Agresti-Coull interval 




investigate the minimum over a central subset, such as [0.1,0.9]. 

The problem of undercoverage is illustrated in Figure [7j in which the minimum 
coverages of the Jeffreys, Wilson and Agresti-Coull intervals are shown for different 
n when the minimum is taken over either p G [0.01, 0.99] or p G [0.1, 0.9]. For p G 
[0.01, 0.99] and a moderately large sample size of n = 250, the minimum coverage 
of the Jeffreys interval is approximately 0.88, whereas the minimum coverage of 
the Wilson score interval is about 0.93. The Agresti-Coull interval fares somewhat 
better, with a minimum coverage of 0.94. In this setting neither the Jeffreys nor 
the Wilson score interval has a minimum coverage above 0.94 even for a sample 
size as large as n = 2000. 

A coverage of 0.94 for a nominal 0.95 method is well below what one should 
expect for sample sizes as large as n = 2000. If undercoverage of this size is unac- 
ceptable, one may apply computer-intensive coverage- adjustment method similar 



to those discussed in Reiczigel (2003), decreasing a to some 7 for which the min- 
imum coverage over some set of values of p is at least 1 — a, thus making the 
methods exact. Decreasing a will however increase the expected length of the 
intervals. 

Comparing sample sizes of the 1 — 7 Jeffreys interval and the 1 — a Clopper- 
Pearson interval, we have: 



d + 2p q (zl /2 - 2z 2 j/2 ) + 2z a/2x l zl /2 plql + 
n + (d,p ,a,-f)) « 



d? 

For n between 1000 and 1500, computer-intensive adjustments of the Jeffreys inter- 
val lead to 7 ~ 0.04 (the actual 7 being somewhat larger than 0.04). For p — 1/2 
and d = 0.04, we get n+(0.04, 1/2, 0.05, 0.04)) « -186, i.e. that the Clopper- 
Pearson interval requires 186 observations fewer to obtain the desired expected 
length. In general, approximate intervals that have been adjusted to be exact are 
outperformed by the Clopper-Pearson interval. 
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Similarly, if one is willing to use approximate intervals, it is possible to apply 
coverage-adjustments to the Clopper-Pearson interval in order to adjust its mean 
coverage to 1 — a. The resulting 7 > a, meaning that the interval becomes 
shorter after the adjustment. Thulin (2013) studied this problem in detail for 



n < 100, showing that the adjusted Clopper-Pearson intervals often outperformed 
its competitors. 

It should be noted that other criterions than coverage and expected length 
can be used for comparing confidence intervals. Newcombe (2011, 2012) compared 
location properties, i.e. left and right non-coverage, of intervals and found the 
Clopper-Pearson interval to have good properties in comparison to some approxi- 
mate intervals. Vos & Hudson (2005) considered two criterions related to p- values, 
motivated by the interpretation of confidence intervals as inverted tests, and found 
the Clopper-Pearson interval to be better than its competitors. 



5.3 Conclusions 

When choosing between exact and approximate confidence methods, it is impor- 
tant to be aware of the benefits and the costs associated with the two types of 
methods. The coverage fluctuations of approximate intervals have been compared 
in several studies, making it easy for practitioners to compare how costly these 
intervals can be in terms of under cover age. We have attempted to make the costs 
of using exact methods explicit, by giving expressions for how much larger the ex- 
pected length of the exact intervals are and for how much the sample size increases 
when a fixed expected length is to be attained. 

For the two-sided Jeffreys interval, exactness comes at a fixed price: the cost of 
using the Clopper-Pearson interval instead of this intervals is, in terms of expected 
length and required sample size, insensitive to p and a. For the Agresti-Coull 
interval, the cost only depends on a. This stands in contrast to the Wilson score 
interval and one-sided bounds, for which p and a can greatly affect the cost. In 
either case the required sample sizes for the exact methods can be substantially 
larger than those of the approximate methods. 

In our comparison of exact and approximate methods, the only exact methods 
considered were the Clopper-Pearson interval and bound. While other shorter 
exact two-sided intervals exist, they suffer from various problems that make them 
unsuitable for use. Moreover, the Clopper-Pearson interval is used far more often 
than the other exact intervals, which merits its role as the main subject of this 
study. 
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Appendix: Proofs 

Theorem [T] follow directly from the following lemma, which is used in the proofs 
of Theorems |2] and [U 

Lemma 1. With assumptions and notation as in Theorem^ the bounds of the 
Clopper-Pearson interval are 

p L =p- n-V 2 z a/2 (pq) 1/2 + (Zn)- 1 (2(1/2 - p)z 2 a/2 - (1 + p)) 

-3/2 ,^MI2( 53 i~P , Z */2 + U 13z l/2\ _ 2 , 

-n 3/ ^/ 2 (M) 1/2 (-^-^— + 3ir) +0(n } ' 

p p = p + n-V 2 z a/2 m 1/2 + ( 2 (V2 - p)^ /2 + (1 + ?)) 

-3/2 /— M/2/ 53 , \~P , 4/2 + 11 13 4/2\ , ~, _ 2 \ 

The approximations are close to the actual bounds even for small sample sizes. 
When n = 25 and p is not too close to or 1, the approximations are typically 
accurate up to at least least two decimal places. 

Proof of Lemma^ First, we note that the lower limit of the Bayesian interval 
with prior Beta(a,b), a,b > 0, is given by the beta quantile pB(a,b,X,n) = 
B(a/2,X + a,n- X + b). 

For the Clopper-Pearson interval p^ is the beta quantile B(a/2, X, n — X + 1). 
WhenX {0,n} this can be written as B(a/2, {X - 1) + 1, (n - 1) - (X - 1) + 1), 
i.e. p.b(1, 1, X — 1, n — 1), the lower limit of the Beta(l, 1) interval for X — 1 and 

71—1. 

An asymptotic expression for ps(a, b, X, n) in terms of p = (X + a — l)/(n + 
a + b — 2) is given in expression (A. 23) in Brown et al. (2002). We obtain the 
asymptotic expansion of pl by taking the expansion for Pb(1, 1, X — 1, n — 1) and 
rewriting the bound in terms of X/n, in a manner similar to equation (A. 26) of 
Brown et al."] 02002) . The expansion of p;y is derived analogously. □ 



Proof of Theorem^ Using the expansion in Lemma [TJ when X {0,n} 
Lcp =Pu~Pl =2n~ 1/2 z a/2 (pq) l/2 + n~ l + n" 3/2 m(p) + R n , 
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where 



m (p) = (pq) 1/2 ^ (z 2 /2 + 2- llpq - 13pgz 2 /2 ) 



and E(R n ) = 0(n 2 ) by Theorem 7 of Brown et al. (2002). As the contribution to 
expected length given by X e {0,n} is P(X £ {0, n}) ■ (1 - (a/2) 1 /") = 0((l/2) n ), 
when computing E(Lcp) we can disregard the fact that the above expansion is 
invalid for X e {0,n}. 

The n _1//2 term is the length of the Wald interval, the expectation of which 
was given in 



Brown et al. (2002): 



E^z^n-^ipq) 1 / 2 ) = 2z a/2 n- 1 / 2 {pq) l l 2 (l - (Snpq)- 1 ) + 0(^ 2 ). 

m(p) is bounded when X ^ {0, n} and m(p) is twice differentiable for < p < 1. 
Thus, by the theorem in Section 27.7 of [Cramir (1946), 

E(m(p)) = (pg)- 1/2 ^(^ /2 + 2 - 17 W - 13 M ^ /2 ) + 0{n- x ) 

and (|6l) follows after all terms of the same order are collected. 



□ 

Proof of Corollar y^ (|8|) and (9| are obtained by comparing g to the expansions 
in Theorem 7 och Brown et al. (2002). In particular, compared to the length Lws 
of the Wilson score interval, 

E{L CP ) =E{L ws )+n- 1 

z 



n 



-3/2_ 



,26 2\2 
9z(z+(-pq-- 



36(pq)V 2 

compared to the length Lac of the Agresti-Coull interval 
E{L CP ) =E(L AC )+n- 1 



Upq{\ - 2z 2 ) - A +0(n~ 2 ), 



n 



-3/2 _ 



36 (pg) 1/2 



,26 2\2 
9:(2: + (-pg-- 



+ pg(34 - 108z 2 



□ 



The proof of Theorem [3] is in analogue with the proof of Theorem [2] and is 
therefore omitted. It relies on the expansion for the expected distance of the 
one-sided Wald bound found in Corollary 1 of Cai (2005). 
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